Red Wine Quality Exploration

About this project

In this project, I will analyze the correlation of the different factors and try to find which chemical properties influence the quality of red wines?

One variable analysis

Let’s focus on each factor and see what we have.

## [1] 1599
## [1] 13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

From this summary, you can see there are 1599 observations and 13 variables in our file. Now, let’s see some relations in the graphs.

From this graph, we can see that the wine quality is concentrated in level 5 and 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol:

The alcohol level of the red wine is concentrated around 9.5, The average alcohol level is 10.2.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Density:

The density level nearly equals to 1, the average is 0.9967, there is a very small difference between min and max density level.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates:

The majority of the sulfates level is very small, is around 0.6. The average is 0.68. We can see that the red wine generally contains very few sulfates.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Fixed.acidity and volatile.acidity:

The curious thing is, the fixed.acidity is almost 10 times bigger than the volatile.acidity. The both are left skewed. They seem have some positive relationship.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric.acid:

the red wine contains very few citric.acid, the average is 0.27. But there is a very big gap between the max and min, it can be affected by other factors or even affect the quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual.sugar:

The sugar level for the red wine is very low, but we still can see there is a very big gap between the sweetest wine and the least sweet wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides:

very few chlorides inside the quality, it might be not the factor which affect most the quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Free.sulfur and total.sulfur:

the max total.sulfur can reach 289, which is too much for a red wine, and the maxi free.sulfur is very much as well, these 2 can be the reason which affect the quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH:

The average pH of red wine is 3.3, is very alkaline.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

correlation between quality and alcohol is 0.476

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

correlation between quality and density is -0.175

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

correlation between quality and sulfates is 0.251

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

correlation between quality and fixed.acidity is 0.124

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

correlation between quality and volatile.acidity is -0.391

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

correlation between quality and citric.acid is 0.226

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

correlation between quality and residual.sugar is 0.014

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

correlation between quality and chlorides is -0.130

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

correlation between quality and free.sulfur.dioxide is -0.051

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$total.sulfur.dioxid
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

correlation between quality and total.sulfur.dioxide is -0.185

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

correlation between quality and pH is -0.058

Observation and summary of this part

What is the dataset Structure?

This data set contains 1599 observation and 13 variables. Except the quality column, the rest are the chemical factor which is possible to affect the red wine quality.

What are my main findings?

Quality of the red wine is the point of our analysis, but the wine in our data set, mostly are middle class wine. As we have analyze each factor, their distribution, average, max and min. And we can also see in the correlations table. Alcohol, volatile.acidity, citric.acid, sulfates, total.sulfur.dioxide, density and chlorides have the relatively strong relationship with quality. We will exclude to analyze the relationship between quality and pH, sugar level, free.sulfur.dioxide since their correlation almost equal to 0.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think except the alcohol and volatile.acidity the main factor, the citric.acid, sulfates, total.sulfur.dioxide, density and chlorides might also help me to support in your investigation.

Did you create any new variables from existing variables in the dataset?

I did not, til now I think is not necessary. I might create one later.

Two variable analysis

Relationship between quality and alcochol

Is easy to see that there is positive correlation between alcohol and quality, and the higher alcohol in the red wine has higher quality.

Relationship between quality and volatile.acidity

This is also very obvious that the volatile.acidity has a strong negative relationship with quality, the higher volatile.acidity the lower quality the red wine has.

Relationship between quality and sulphates

The correlation between sulfates and quality is relatively lower than the previous 2 factors, but still you can see they do have a positive relationship, the higher sulfates, the better quality has the wine.

Relationship between quality and citric.acid

The same conclusion comes here, citric.acid do have a positive relationship with quality.

Relationship between quality and total.sulfur.dioxide

Observation and summary of this part

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

All right, like you can see in these graphs and analyses, is easy to tell that there are strong relationship between alcohol,volatile.acidity and citric.acid with quality. The more alcohol contains in the wine, better quality is. And more volatile acidity has the wine, the worse quality is.

Compare with these factors, sulfates,total.sulfur.dioxide, density and chlorides have weaker relationship with quality.

What was the strongest relationship you found?

The strongest relationship I found with quality are alcohol and volatile.acidity.

Many variable analysis

As we have found that Alcohol and volatile.acidity seem have the strongest relationship with quality, but we still like to go deeper to see whether there are another factor can affect more quality. I am also interested to know what is influencing alcohol and volatile.acidity. So we will find the relationship of other factors with these 2.

Alcohol

Relationship between alcohol, density and quality

We can definitely notice that there is a strong negative relationship between density of red wine and alcohol, you can also see the high quality red wine is concentrated in the low density but high alcohol area.

Relationship between alcohol, sulphates and quality

Weak relationship.

Relationship between alcohol, citric.acid and quality

Weak relationship.

Relationship between alcohol, total.sulfur.dioxide and quality

Weak relationship.

Relationship between alcohol, residual.sugar and quality

Almost no relationship

Relationship between alcohol, pH and quality

Weak Relationship

Conclusion:

As we can see til here, the strongest relationship of alcohol is with density of red wine.

volatile.acidity

Relationship between volatile.acidity, fixed.acidity and quality

We can see the both acidity actually have a quite strong relationship, negative correlations. Higher fixed.acidity with lower volatile.acidity. They both also affect on the quality. We can see the better quality wine is concentrated in the high fixed.acidity and low volatile.acidity area.

Relationship between volatile.acidity, density and quality

No relationship

Relationship between volatile.acidity, sulphates and quality

There are some relationship, but is not very strong.

Relationship between volatile.acidity, citric.acid and quality

Those factors seem also have a quite strong relationship. The higher citric acid is the lower volatile acidity is. The higher quality wine trend to be in the high citric acid and low volatile acidity area.

Relationship between volatile.acidity, residual.sugar and quality

No relationship

Relationship between volatile.acidity, free.sulfur.dioxide and quality

No relationship

Conclusion:

As we can see til here, the strongest relationship of volatile.acidity is with citric.acid and fixed.acidity.

Observation and summary of this part

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Yes, definitely, as I have analyzed in this part, we can see that alcohol has strong relationship with density of red wine and volatile.acidity with citric.acid and fixed.acidity.

Were there any interesting or surprising interactions between features?

Yes, at beginning I thought the alcohol and volatile.acidity are the only factors which affect strongly the wine quality, but later after plot more relationship from other factors with alcohol and volatile.acidity, I found there are actually many factor they are related mutually and they both work together can affect even more on the wine quality.

Final Plots and Summary

As we can see from these graphs, is quite clear that alcohol has the strongest relationship with quality. More alcohol level has the wine, better quality is.

As we can see from these graphs, is quite clear that volatile.acidity has the strongest relationship with quality. Less volatile.acidity has the wine, better quality is.

We can also notice that density is the biggest factor which affect on the alcohol. They both together will influence more on the wine quality.

The better wine has higher alcohol and lower density.

Fixed.volatile is one of the strongest factors associated with the level of volatile.acidity, the correlations of volatile.acidity with it is negative. So more volatile.acidity, low fixed volatile.

It also affects on the wine quality, the good quality wine has low volatile.acidity and high fixed volatile.

Citric acid the second strongest factor associated with the level of volatile.acidity, the correlations of volatile.acidity with it is negative. So more volatile.acidity then less citric acid.

It also affects on the wine quality, the good quality wine has low volatile.acidity and high fixed volatile and citric acid.

Reflection

I started to analyze firstly the quality distribution to see where are the most wine samples and which quality they have. Then I started to think what will be the relationship between the quality and other factors. So I used a lazy and easy way to see directly, then I used cor.test. Firstly I excluded the few factors which almost do not have any relationship with quality, then I can focus on the ones which have stronger relationship. Then step by step I found deeper factor which can influence on the quality.

In this analyze process, I definitely used a lot of knowledge of R and also tried to think by myself how to start, how to plot, how to analyze, what is my point. That makes me feel really good!

My struggles are at beginning, how could I start this project, what was the main point of this analysis, what was the point I wanted to achieve. I was struggling with the starting point. But later I figure out that the main point is the quality and how the other factors affect on it.

For the future work, I would definitely think even more question about the data set, for example what are the factors that will not affect on the wine quality? Or what will be the perfect wine for health seekers? What is the level of the chemical factor for the 10 level quality wine? I think we will have a chance to dive deeper those questions in the future.